﻿ REQUIREMENTS-DRIVEN AUTOMATIC CONFIGURATION OF NATURAL LANGUAGE APPLICATIONS Dan Cristea, Corina Forăscu, Ionuţ Pistol University “Al I Cuza” of Iaşi, Faculty of Computer Science dcristea@infoiasi ro, corinfor@infoiasi ro, ipistol@infoiasi ro Keywords: tools and resources, natural language processing, annotation schemes Abstract: The paper proposes a model for dynamical building of architectures intended to process natural language The representation that stays at the base of the model is a hierarchy of XML annotation schemas in which the parent-child links are defined by subsumption relations We show how the hierarchy may be augmented with processing power by marking the edges with names of processors, each realising an elementary NL processing step, able to transform the annotation corresponding to the parent node onto that corresponding to the child node The paper describes a navigation algorithm in the hierarchy, which computes paths linking a start node to a destination node, and which automatically configures architectures of serial and parallel combinations of processors 1 INTRODUCTION allows classification, simplification and merging operations to be performed on files observing the In this paper we propose a methodology that allows restrictions of the annotation schemas that are for the automatic configuration of architectures of comprised in the hierarchy We describe how the serial and parallel combinations of natural language graph may be augmented with processing power by (NL) processors, each able to perform an elementary marking edges linking parent nodes to daughter transformation to an input file The input and output nodes with names of processors, each realising an of the modules (between the processing steps) are elementary NL processing step On the augmented XML annotated files graph, three operations are defined: simplification, GATE (Cunningham et al , 2002, 2003) is an pipeline and merge We present then a navigation extremely versatile environment for building and algorithm in this hierarchy, which computes paths deploying NLP software and resources It allows for between a start node, corresponding to an input file, the integration of a large amount of built-ins in new and a destination node corresponding to an output processing pipelines that can be put to work on file To these computed paths correspond sequences single documents or corpora In order to build a of operations, which are equivalent to architectures pipeline the user is instructed to select the modules of serial and parallel combinations of processors (called resources in GATE) needed as parts of the When an input file is given to a system that processing chain, in the correct processing order and implements these principles, and the requirements of to instantiate their parameters When all these are an output annotation are specified as the destination done, the configured chain of processes may be put node, first the XML annotation schema of the input to work on an input file, with the result of obtaining file is determined, then this schema is classified onto an output file, XML annotated The model we the hierarchy, becoming the start node, then the propose comprises a combination of processing expression of operations corresponding to the steps and filtering steps The processing steps add minimum paths linking the start node to the information while filtering steps remove destination node is computed (the architecture), and information finally the input file is given to this architecture, Our approach is based on Cristea and Butnariu’s resulting in the expected output file (2004) hierarchy of annotation schemas In this Section 2 of the paper reviews the hierarchical model, XML annotation schemas are nodes in a model of annotation schemas, while section 3 directed acyclic graph, and the hierarchical links are presents the hierarchy augmented with processing subsumption relations between schemas The model power In section 4, the operations associated to the augmented graph are defined In section 5, the algorithm that computes the sequence of operations section 7 discusses the feasibility of the approach in practical settings and briefly presents an implementation that operates on these principles Winston 2 THE GRAPH id="3" pos="V" lem="be">was REPRESENTATION OF dreaming of In (Cristea and Butnariu, 2004), different layers of annotation over a corpus are represented as a hierarchy of annotation schemas A node in the his of XML tags, each characterised by a name, a list of attribute names, and possible restrictions denoting mother of attributes of other tags in the hierarchy The parenthood relationship places the schemas described in this way in a hierarchy, which is a directed acyclic graph whose node names are unique symbols If a node A is directly linked to a node B, then it is said that A subsumes B in the hierarchy He (therefore B is a descendent of A) This happens if and only if: must - any attribute in the list of attributes of a tag- , name in A is also in the list of attributes of the same tag-name of B; head-id="12" id="4" coref="0"> - any restriction which holds in A also holds in B he SCH-ROOT thought SCH-TOK SCH-SEG SCH-PAR , SCH-POS have been SCH-NP SCH-VP ten SCH-COREF SCH-SEG-NP-VP or eleven years old Figure 1: Example of a hierarchy of schemas (adapted after an example from (Cristea and Butnariu, 2004)) Figure 2: Example of annotation The subsumption relation indicates that each 3 MARKING THE EDGES OF node in the hierarchy inherits all features (seen here as tags and their attributes) of all of its parents So, if THE GRAPH WITH A subsumes B, B is an annotation schema which is PROCESSESOR NAMES more informative than A and/or defines more constrains In general, either B has at least one tag-The modern software engineering design uses name which is not in A, and/or there is at least one interchangeable modules, which are interconnected tag-name in B such that at least one attribute in its in complex processing architectures In NLP, this list of attributes is not in the list of attributes of the approach has proven advantages regarding the homonymous tag-name in A, and/or there is at least reusability, and language and application one constrain which holds in B but which doesn’t independence In such a view, each module has hold in A The subsumption relation is transitive, inputs, outputs and accesses resources In order for reflexive and asymmetrical the modules to be truly interconnectable, each of the Figure 1 shows an example of a hierarchy of module’s inputs and outputs must observe the schemas describing different layers of annotation constraints of certain annotation schemas Usually useful for many NLP applications SCH-ROOT the language and, sometimes, application represents the “empty” annotation (no tags) dependence, of a module is given by the specific set Immediately under this trivial schema, three of resources it accesses For instance, a POS-tagger, schemas, SCH-TOK, SCH-SEG and SCH-PAR are runs the same algorithms on different sets of placed SCH-TOK identifies tokens and marks word language models in order to tag documents for POS lemmas, SCH-SEG marks borders between in different languages For the system builder, the elementary discourse units, while SCH-PAR marks real functioning of a module can be obscured in a paragraphs SCH-POS, placed under SCH-TOK, black box, since is it fully determined by the triplet: does not contribute with new tags but it input, output and resources This is equivalent with complements the token tag with an attribute that saying that given a triplet of schemas, characterizing indicates the part-of-speech SCH-POS is a parent the input, the resources and the output, there should for both SCH-NP and SCH-VP schemas These exist a module which produces as output a file mark noun phrases (NPs) and, respectively, verb observing the restrictions of the output schema, phrases (VPs) Then, SCH-COREF, placed under whenever it receives as input a file observing the SCH-NP, marks anaphoric links between co-restrictions described by the input schema, and referential NPs SCH-SEG-NP-VP is a schema accesses resources observing the resources schema marking simultaneously noun phrases, verb phrases This way, the hierarchy of annotation schemas and discourse units boundaries It adds no new becomes a graph of interconnecting modules markings to those inherited from its three parents More precisely, if a node A subsumes a node B Finally, SCH-COREF-IN-SEG is a schema in which (see Figure 3), there should be a process which takes the co-references and segment boundaries are as input a file observing the restrictions imposed by marked the node A and produces as output a file observing An example of an XML file observing the the restrictions imposed by the node B While doing restrictions of the SCH-COREF-IN-SEG node is this type of processing, the module might make use shown in Figure 2 input subsumption prelationprocess resources output Figure 3: Equivalence between the subsumption relation and a processing step also of some resources However, in our graphical notations and the considerations to follow, only the 4 OPERATIONS IN THE ROOT SCH-tokeniserparagraph AUGMENTED GRAPH segmenter finder SCH-TOK SCH-SEG SCH-PAR If, in the hierarchical graph of schemas, navigation in the graph allows to classify files in the hierarchy, POS-tagger to simplify annotations and to merge annotations, as SCH-POS described in (Cristea and Butnariu, 2004), in the graph augmented with processing power, navigation NP-chunker VP-chunker allows the automatic identification of processing SCH-NP SCH-VP steps Any process resulting from this computation is a combination of serial processing with merges ∅∅ AR ∅Unlike GATE, which allows only pipeline processing, in which the whole output of the SCH-COREF SCH-SEG-NP-VP preceding processor is given as input to the next processor, in our model a combination of branching ∅ ∅pipelines may result SCH-COREF-IN-SEG Once the computation of steps is done using the hierarchy augmented with processing power, then Figure 4: The hierarchy in Figure 1 the computed process can be applied on an input augmented with processing power file and, eventually, it produces an output file input-output relations will be retained These files comply with the restrictions encoded by Let’s note that the directionality of a process, as a start node and, respectively, a destination node attached to an edge of the graph, is that of the of the hierarchy subsumption relation So, if node A subsumes node A processing task is defined by a pair of B, then the hierarchical link is from A to B (from the annotation schemas, start and destination parent to the descendant) In our figures this will be Transposed on the processing graph, provided the marked by oriented edges (arrows) To mark the two schemas are represented as nodes in the graph difference between edges denoting subsumption and since the graph is connected, there should relations and edges denoting processing, we will always be at least one path connecting these two mark with thin arrows the subsumption relations and nodes The paths found are made up of oriented with labelled thick arrows processing steps, where edges, and, as we will see, it is important if the the labels indicate the names of the processes We orientation of the edges is the same as the one of the will call a graph of annotation schemas on which path or no processing modules have been marked on edges as We will describe later in this section three being augmented with processing power (or operations associated with the computed paths Due simply, augmented) to these operations, the otherwise static set of Sometimes the existence of a process attached to alternative paths linking a start node to a destination an edge in the graph depends on the existence of node determine a set of alternative processing adequate resources For instance, one may have paths or flows, which represent dynamically access to an automatic tagger, but it will not be able configured architectures There are two ways to look to apply it for a language L because of the lack of a at flows as processes: as applied to nodes of the language model (a resource) adequate for that graph and as applied to files A flow transforms an language This way, in a repository of resources and input (node or file) by adding or deleting some instruments dedicated to NLP, the maximal graph of mark-ups, seen as definitions in a node (schema) and annotation schemas hosted can have different as actual annotation in a file The term “flow” comes instantiations for different languages, depending on easily if we imagine that the information actually the existence (or absence) of adequate resources “flows” through the edges of the graph, while also Figure 4 represents the hierarchy of schemas from producing changes in the input files Different Figure 1, augmented with processing power The examples of flows, linking start nodes with names of processes are marked on some edges and destination nodes, are sketched in Figure 5 in thin, the symbol ∅ marks the empty process (no interrupted arrows contribution of new tags/attributes) More precisely, a flow must be seen as summing-up sequences of processing steps We will denote by f(x) the flow applied to the input node, or file, x So, y=f(x) means either that the destination node y is obtained by applying the flow f to the start EC b a B a CDDAE b ccd ABFf e A G a bcg B Figure 5: Examples of processing paths (in thin, interrupted arrows) node x, or that the output file y results by the with a process p, we say that A pipelines to B by p, application of the flow f to the input file x All three and we write B = A>p Equally, when a file operations which will be defined below produce corresponding to the schema A is pipelined to B by flows Trivially, an empty flow leaves the input p, it will be transformed by the process p onto a file unmodified We will notate the empty flow with f∅ that corresponds to the restrictions imposed by B (x)=x, for any node or file x The way in which Finally, according to the discussion on flows, we So, f∅ we will define the computation of flows (in section may consider “pipelines by p” an operation which is 5) will make that if A and B are the start and applied to a flow to produce a flow Pipelining with leaves the input unmodified, destination nodes in a graph, then exactly one flow f an empty process ∅ should exist such that B=f(A) Flows can be while pipelining the empty flow with a process combined, such that it is possible to have B= yields the flow consisting of that process In Figure f1(f2(A)) This notation inspires the generalisation of 5b it holds that B=C>b and B=E>a>c a flow as applying to flows instead of nodes or files The merge operation can be defined in Indeed if a node B is placed along a path from A to nodes pointed by more than one edge on the C, we may say that the flow that transforms A onto B hierarchical graph If f1, …, fn, are flows entering a combines with the one that transforms B onto C to node A, we say that flows f1, …, fn merge into A produce the flow that transforms A onto C As such, and we write A= f1 | … | fn For instance, in Figure we may see the input of the second flow as being the 5b we have B=C>b | E>a>c Merging a flow f with first flow, instead of the intermediary node the empty flow f∅ leaves f unchanged In the following we will give the definitions of With these definitions, for the graph of Figure 5b the promised three operations: simplification, it holds that B=SC(A)>b | SE(A)>a>c, and for the pipelining and merging We say that a node B is graph of Figure 5c it holds that B=((A>c | simplified to A, and we write A=SA(B), if B and A SC(A)>a>d)>e | SC(A)>b>f)>g are both placed on a path from the start to the destination node, in this order, and A subsumes B When a file corresponding to the schema B is simplified to A, it will loose all annotations 5 COMPUTATION OF FLOWS excepting those imposed by the schema A In Figure 5a, we have B=SB(A), and on Figure 5b it holds that We give below the Compute-Flow algorithm The C=SC(A) and E=SE(A) In accordance with our notations used are as follows: discussion on flows above, we may look at SB as a function CF(x,y) receives as input a pair of nodes flow, which allows us to apply it to a node, to a file start-destination and returns that flow which applied or to another flow to x outputs y; subsumes(x,y) is a predicate If, on a path from the start node to the function applied to two nodes and evaluates to true if destination node in the augmented graph, there is an and only if x subsumes y on the graph; edge linking a node A to a node B which is marked simplify(x,y) expresses that the node y is simplified in the sense given by the node x and returns a flow; pipeline(f,p), with f a flow and for co-referential anaphora, but including also the p a process, expresses that the flow f is pipelined marking of tokens, their part-of-speech, and the with the process p and returns a flow; noun phrases – which usually count as referential merge(f1,f2) expresses the merging of flows f1 expressions The computed flow is, in the abridged is the empty notation for pipelines: SCH-ROOT > tokeniser and f2 and returns a flow, and f∅ flow > POS-tagger > NP-chunker > AR, where function CF(st,de) tokeniser is the process which tokenises a raw ); text, the POS-tagger adds part-of-speech if st equals de then return(f∅ else if subsumes(de,st) then markings to a tokenised text, the NP-chunker return(simplify(de,st)); marks NPs on a POS-tagged text and AR is the else if there is just one node n such module doing anaphora resolution on a file having that n pipelines to de by a process p NPs marked; then return(pipeline(CF(st,n),p)); - for the example in Figure 6a, if the node st is else { SCH-SEG-NP-VP and the node de is ; SCH-COREF-IN-SEG, the computation goes as expr := f∅ while (there still exists an edge p follows: linking a node n to the node de) do CF(SCH-SEG-NP-VP,SCH-COREF-IN-SEG) ⇒ expr := merge(expr, M(S(SCH-SEG-NP-VP, pipeline(CF(st,n),p)); SCH-SEG), return(expr); P(S(SCH-SEG-NP-VP,SCH-NP),AR)) }; The meaning of this expression is to simplify in two Note that in the above notation of the function CF ways the original file, and to merge one of them with the input is a pair of nodes start-destination and the the other to which the AR process has been output is a computed flow, as an expression of pipelined simplify-pipeline-merge operators In order to make - for the same graph in Figure 6b, if the node st the computed flow to apply to an input file, the input is SCH-PAR and the node de is SCH-SEG-NP-VP, file must be given to the result of the computation of the computation goes as follows: a call to the function CF Following, we will exemplify with several cases: - for the graph depicted in Figure 5a, CF(SCH-PAR,SCH-SEG-NP-VP) ⇒ ), subsumes(B,A) is true, therefore the algorithm M(P(CF(SCH-PAR,SCH-NP),∅ ), CF(A,B) returns simplify(B,A); P(CF(SCH-PAR,SCH-VP),∅ )) ⇒ - for the graph in Figure 5c, using short notations for P(CF(SCH-PAR,SCH-SEG),∅ (CF(SCH-PAR,SCH-POS),NP-chunker), Pipeline, Merge, Simplify, the recursive evaluation M(P (CF(SCH-PAR,SCH-POS),VP-chunker), proceeds as follows, in which the ⇒ sign should be PP(CF(SCH-PAR,SCH-ROOT),segmenter)) ⇒ read as “evaluates to”: M(P(P(CF(SCH-PAR,SCH-TOK), POS-tagger), NP-chunker), CF(A, B) ⇒ P(P(CF(SCH-PAR,SCH-TOK), P(CF(A,G),g) ⇒ POS-tagger), P(M(P(CF(A,E),f),P(CF(A,F),e)),g) ⇒ VP-chunker), P(M(P(P(CF(A,C),b),f),P(M(P(CF(A, P(S(SCH-PAR,SCH-ROOT),segmenter)) ⇒ A),c),P(CF(A,D),d)),e)),g) ⇒ M(P(P(P(S(SCH-ROOT,SCH-PAR), P(M(P(P(S(A,C),b),f),P(M(P(A,c), tokeniser), P((CF(A,C),a),d)),e)),g) ⇒ POS-tagger), P(M(P(P(S(A,C),b),f),P(M(P(A,c), NP-chunker), P(P(S(A,C),a),d)),e)),g) P(P(P(S(SCH-ROOT, SCH-PAR), This is the same expression as the one noted tokeniser), above in an abridged form; POS-tagger), - for the graph in Figure 1, if in the call to the VP-chunker), function CF the node st is SCH-ROOT and the node P(S(SCH-ROOT,SCH-PAR),segmenter)) de is SCH-COREF, the meaning of the compute request CF(SCH-ROOT, SCH-COREF) is that, starting from a raw text one should get annotations SCH-ROOT SCH-ROOT paragraph paragraph tokeniser finder tokeniserfinder segmenter segmenter SCH-TOK SCH-SEG SCH-PAR SCH-TOK SCH-SEG SCH-PAR POS-taggerPOS-tagger SCH-POS SCH-POS VP-chunker NP-chunker VP-chunkerNP-chunker SCH-NP SCH-VP SCH-NP SCH-VP ∅ ∅ ∅ ∅∅ AR ∅ AR SCH-COREF SCH-SEG-NP-VP SCH-COREF SCH-SEG-NP-VP ∅∅ ∅∅ SCH-COREF-IN-SEG SCH-COREF-IN-SEG ba Figure 6: Examples of processing paths (in thin, interrupted arrows) for the hierarchy in Figure 1 for the simplification of computed architectures The problem with this solution is that parts of An implementation of the hierarchy model has computations are repeated The redundancy in been developed for a portal intended to act as a processing happens because different paths are repository for Romanian language resources and partly superposed An abridged architecture, which NLP tools The present implementation performs does not show the null processes, is depicted in automatic classification, simplification and merging Figure 7 Starting from a text with paragraph Presently, a number of modules locally developed markings, one should first perform a “simplify” have been linked with the edges of an existent graph operation, to get the raw text, corresponding to the of schemas The coupling with the GATE style of SCH-ROOT node, then the output should be processing is under study The final machinery will processed in parallel by a text segmenter and a chain develop into a portal displaying on-line processing, formed by a tokeniser and a POS-tagger At this to which a user can send its own files, he indicates point the process splits again onto an NP-chunker the desired final annotation and receives the output and a VP-chunker The outputs of these two file processes are finally merged with the output of the segmenter The current algorithm does not provide NP- tokePOS- Paragraph VP- segm Figure 7: Computed path between nodes in the annotation schemas hierarchy 6 DISCUSSION describes a translation process, in which the annotation is not enriched, but names of tags, The proposed approach has an apparent attributes and values are changed Ideally, the drawback which is the large diversity of annotation processing abilities of the hierarchy should include schemas which could appear in different also the capability to automatically discover the applications This amounts to a huge graph of translation procedure This task is not trivial since it schemas if the ambition is for exhaustiveness would require that the hierarchy “understands” the However, the need for standardisation is evident intentions hidden behind the annotation, displaying nowadays and has been very clearly stated in many an intelligent behaviour which is not easy to contexts, for instance (Ide et al , 2003) The more implement This subject could make an interesting and more common use of international standards for trend of further research Now, once the entry and the annotation of documents, such as TEI and exit points in the hierarchy have been determined CES/XCES, will make widely applied and translation links have been devised, all the rest standardisation a reality Moreover, Semantic Web, is done by the hierarchy itself augmented with the with its tremendous need for interconnection and processing power in the manner described above integration of resources and applications on This way, the processing needed to arrive from the communicating environments, boosts vividly the input to the output is computed by the hierarchy as appeal for standardisation It is therefore foreseeable sequences of serial and parallel processing steps, that more and more designers will adopt, in order to each of them supported in the hierarchy by means of allow easy interfacing of their applications, specialised modules Then the process itself is recognised standards The large acceptance of XML launched on the input file It includes an initial as an annotation language and the development of a translation phase, followed by a sequence of variety of sublanguages based on XML, and the simplifications, pipelines and/or merges, as adoption of encoding standards as TEI and CES in described by the computed path, and followed by a text processing makes this challenge a reality final translation, which is expected to produce the But there is another reason for the drawback to output file be only apparent We have seen already that, by The linguistic annotations can make use of classification, any schema could be placed in the ontologies as formalised schemas specifying what hierarchy Of course, classification could increase in can actually be annotated, and hence the annotation an uncontrolled way the number of nodes of the schemas can be considered special cases of semantic hierarchy The proliferation could be caused not so annotations with regard to an ontology, such as the much by the semantic diversity of the annotations as one pursued within the context of the Semantic Web by the differences in name spaces (names of tags, The formalization of the annotation schema as an attributes and values) Suppose one wants to connect ontology, and the use of standard formalisms such as a new file to the hierarchy in order to exploit its RDF or OWL to encode it, allows for the reuse of processing power What s/he has to do is to first the schema across different annotation tools The classify the file If the system reports the result as being a new node in the hierarchy than its position linguistic annotation model based on an ontology gives also indications of its similarity/dissimilarity offers flexibility in the sense that it is general with the neighbouring schemas A visual inspection enough to be applied in a broad variety of annotation of the names used can reveal, for instance, that a tasks simple translation operation can make the new node identical to an existent one This means that the new schema is not new for the hierarchy, although the set ACKNOWLEDGMENTS of conventions used, which make it different from those of the hierarchy, are imposed by the We thank Valentin Tablan and Dan Tufiş for their restrictions of the user’s application The solution to suggestions and comments during the final stages of this incompatibility is not always a despotic attitude the elaboration of this paper vis-à-vis of the adoption of new notation conventions, but rather a flexible way of looking at the diversity Technically, this can be achieved by temporarily creating links between the new schema classified by the hierarchy, as a new node, and its REFERENCES corresponding standard in the hierarchy Processing along such a link is different than the usual Dan Cristea and Cristina Butnariu 2004 behaviour associated to the edges of the graph It Hierarchical XML representation for heavily annotated corpora In Proceedings of the LREC 2004 Workshop on XML-Based Richly Annotated Corpora, Lisbon, Portugal Hamish Cunningham, Diana Maynard, Kalina Bontcheva, Valentin Tablan 2002 GATE: A framework and graphical development environment for robust NLP tools and applications In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02) Philadelphia, US Hamish Cunningham, Valentin Tablan, Kalina Bontcheva, Marin Dimitrov 2003 Language engineering tools for collaborative corpus annotation Proceedings of Corpus Linguistics 2003, Lancaster, UK Nancy Ide, Laurent Romary, Eric de la Clergerie 2003 International standard for a linguistic annotation framework In: Proc HLT-NAACL'03 Workshop on the Software Engineering and Architecture of Language Technology Edmonton, Canada 